Impact Analysis of OCR Quality on Research Tasks in Digital Archives

نویسندگان

  • Myriam C. Traub
  • Jacco van Ossenbruggen
  • Lynda Hardman
چکیده

Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCRinduced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the users’ side as well as on the tool makers’ and data providers’ side is insufficient and needs to be improved.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sociological Impact of Using Digital (Web-based) Analyses on Performance Measurement and Optimization of Digital Marketing among Young Managers (Case study: Digital-based Companies in Tehran)

This research aims to study the effect of using digital (web-based) analyses in performance measurement and optimization of digital marketing in digital-based companies in Tehran. The data collection tool was a researcher-made questionnaire. A panel of experts and supervisor were asked to measure the validity of the questionnaire. For reliability analysis of this tool, Cronbach’s alpha test was...

متن کامل

Document Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)

Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...

متن کامل

The Role of Digital Technologies as an Alternative for Face-to-Face Knee Rehabilitation: A Systematic Review

Background: Digital technologies, including mobile applications, websites, and wearable devices, like smartwatches are among the newest approaches in prevention, care, and treatment studies; they could provide public access to high-quality rehabilitation services. The current review study aimed to evaluate the effects of digital technologies for enhancing physical activity, as well as improving...

متن کامل

Hybred: An OCR Document Representation for Classification Tasks

The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to th...

متن کامل

The Impact of Digital Government on Whistleblowing and Whistle-blower Protection: Explanatory Study

This paper focuses on the contribution of digital government (DGOV) to Whistleblowing (WB). While considerable efforts have been devoted to DGOV and WB separately, research work at the intersection of these two domains is very scarce; hence and a systematic DGOV for WB (DGOV4WB) research framework has yet to emerge. This paper aims to identify the potential issues in whistleblowing and explore ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015